Towards Integrated Acoustic Models for Speech Synthesis
نویسندگان
چکیده
All Statistical Parametric Speech Synthesizers consist of a linear pipeline of components. This view means that the synthesizer consists of a top-down structure where data fed into the synthesizer goes to front-end, then to the prediction algorithm, then to the waveform generation, and so on until the speech is finally constructed. Each component in this pipeline naively receives a stream of numbers from the preceding component, and spits out a stream of numbers for the next one in line, with little to no knowledge of what happens in the larger scheme of the pipeline. In this thesis, I argue against this “Markovian” structure, and instead propose the idea of an Integrated structure. In an integrated structure, every component in the system influences, and is in turn influenced by every other component in the system. This thesis describes four sets of experiments that move towards this idea. The first involves using lexical information to improve waveform generation algorithms. The second tries to increase the interaction between prediction algorithms and waveform generation. The third is an attempt to derive phonemes and phonetic information automatically from the speech rather than from the text. The last, and probably hardest, describes an idea for an evaluation metric that pays attention to multiple components of the synthesizer, rather than focusing on just a single one.
منابع مشابه
Prosodic models and speech synthesis: towards the common ground
Prosodic models have been extensively applied in speech synthesis. However, the necessity of synthesizing prosody has as yet not resulted in a generally agreed upon approach to prosodic modeling. This statement holds for the assignment of segmental durations as well as for generating F0 curves, the acoustic correlate of intonation contours. This paper concentrates on the use and usability of in...
متن کاملFirst Step Towards End-to-End Parametric TTS Synthesis: Generating Spectral Parameters with Neural Attention
In conventional neural networks (NN) based parametric text-tospeech (TTS) synthesis frameworks, text analysis and acoustic modeling are typically processed separately, leading to some limitations. On one hand, much significant human expertise is normally required in text analysis, which presents a laborious task for researchers; on the other hand, training of the NN-based acoustic models still ...
متن کاملTowards Articulatory Speech Synthesis with a Dynamic 3D Finite Element Tongue Model
We describe work towards articulatory speech synthesis driven by realistic 3D tissue and bone models. The vocal tract shape is modeled using a fast 3D finite element method (FEM) of a muscle-activated human tongue in conjunction with fixed rigid models of jaw, hyoid and palate connected to a deformable mesh representing the airway. Actuation of the tissue model deforms the airway providing a ti...
متن کاملArtisynth: an extensible, cross-platform 3d articulatory speech synthesizer
We describe our progress on the construction of a combined 3D face and vocal tract simulator for articulatory speech synthesis called ArtiSynth. The architecture provides six main modules: (1) a simulator engine and synthesis framework, (2) a two and three-dimensional model development component, (3) a numerics engine, (4) a graphical renderer, (5) an audio synthesis engine and (6) a graphical ...
متن کاملAcoustic and Visual Analysis of Expressive Speech: A Case Study of French Acted Speech
Within the framework of developing an expressive audiovisual speech synthesis, an acoustic and visual analysis of expressive acted speech is proposed in this paper. Our purpose is to identify the main characteristics of audiovisual expressions that need to be integrated during synthesis to provide believable emotions to the virtual 3D talking head. We conducted a case study of a semi-profession...
متن کامل